library(tidyverse)
library(plotly)
data <- read_csv('./gapminder_clean.csv') %>%
select(-1) %>%
rename(
co2em = `CO2 emissions (metric tons per capita)`,
popden = `Population density (people per sq. km of land area)`,
lifeExp = `Life expectancy at birth, total (years)`,
energy = `Energy use (kg of oil equivalent per capita)`,
)
data1962 <- data %>%
filter(Year == 1962) %>%
select(gdpPercap, co2em) %>%
drop_na()
ggplot(data = data1962) +
geom_point(mapping = aes(
x = gdpPercap,
y = co2em)) +
labs(x = "GDP per capita", y = "CO2 emissions per capita (metric tons)")
cor.test(data1962 %>% pull(gdpPercap), data1962 %>% pull(co2em))
##
## Pearson's product-moment correlation
##
## data: data1962 %>% pull(gdpPercap) and data1962 %>% pull(co2em)
## t = 25.269, df = 106, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8934697 0.9489792
## sample estimates:
## cor
## 0.9260817
corrs <- data %>%
group_by(Year) %>%
select(Year, gdpPercap, co2em) %>%
drop_na() %>%
summarise(correlation = cor(gdpPercap, co2em))
maxi <- lapply(corrs, max)
The strongest correlation is 0.9387918 in the year 2007.
max_em_year_data <- data %>%
filter(Year == maxi$Year) %>%
select(gdpPercap, co2em, pop, continent, `Country Name`) %>%
drop_na()
fig <- ggplot(data = max_em_year_data) +
geom_point(aes(
x = gdpPercap,
y = co2em,
size = pop,
color = continent,
text = paste("Country: ", `Country Name`,
"\nGDP: ", gdpPercap,
"\nCO2 emissions: ", co2em))) +
xlab("GDP per capita") +
ylab("CO2 emissions per capita (metric tons)") +
ggtitle(str_glue("GDP vs CO2 emissions per capita in ", maxi$Year))
ggplotly(fig, tooltip = "text")
Can we perform ANOVA?
Let’s observe how normal the data is.
data_energy <- data %>%
select(continent, energy) %>%
drop_na()
ggplot(data_energy, mapping = aes(sample = energy)) +
facet_grid(continent ~ .) +
stat_qq() +
stat_qq_line()
Africa, Oceania and Europe look approximately normal, but the others do not. Notice that we can apply the logarithm to the energy usage data because they are all positive. We get an empty tibble when filtering for energy values less than or equal to zero.
data_energy %>%
filter(energy <= 0)
## # A tibble: 0 x 2
## # … with 2 variables: continent <chr>, energy <dbl>
Let’s try the data transformation.
data_energy_t <- data_energy %>%
mutate(energy_transformed = log(energy))
ggplot(data_energy_t, mapping = aes(sample = energy_transformed)) +
facet_grid(continent ~ .) +
stat_qq() +
stat_qq_line()
The data look more normal now. Even though they are not perfect, we have a relatively large sample size for each group, except Oceania. Oceania, though, looks the most normal out of all of them.
data_energy_t %>%
group_by(continent) %>%
summarise(count = n())
## # A tibble: 5 x 2
## continent count
## <chr> <int>
## 1 Africa 199
## 2 Americas 188
## 3 Asia 185
## 4 Europe 256
## 5 Oceania 20
The data come from different continents and are at least 5 years apart. Assume the data are independent due to the space and time separation.
How do the variances look?
ggplot(data_energy_t) +
geom_boxplot(mapping = aes(energy_transformed, continent)) +
xlab("Log of energy usage (kg of oil equivalent per capita)") +
ylab("Continent")
The variances of the transformed data look roughly the same. Oceania stands out as having a smaller variance than the others. Despite this, we will continue.
test_results <- aov(energy_transformed ~ continent, data = data_energy_t)
summary.aov(test_results)
## Df Sum Sq Mean Sq F value Pr(>F)
## continent 4 342.8 85.70 117.7 <2e-16 ***
## Residuals 843 613.9 0.73
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is less than 0.0001. While we went through our assumptions, we saw that the transformed data were not exactly normal, and their variances were not exactly equal. Taking that into account, I still think we can reject the null hypothesis. That is, we reject that the log of the group means are equal. Since the logarithm is a monotone increasing function, we also reject that the group means are equal.
TukeyHSD(test_results)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = energy_transformed ~ continent, data = data_energy_t)
##
## $continent
## diff lwr upr p adj
## Americas-Africa 0.62609440 0.3888318 0.8633570 0.0000000
## Asia-Africa 0.53390365 0.2956539 0.7721534 0.0000000
## Europe-Africa 1.61347063 1.3930062 1.8339350 0.0000000
## Oceania-Africa 1.96990907 1.4226918 2.5171263 0.0000000
## Asia-Americas -0.09219075 -0.3337751 0.1493937 0.8351479
## Europe-Americas 0.98737623 0.7633124 1.2114401 0.0000000
## Oceania-Americas 1.34381467 0.7951374 1.8924920 0.0000000
## Europe-Asia 1.07956698 0.8544581 1.3046759 0.0000000
## Oceania-Asia 1.43600542 0.8869005 1.9851103 0.0000000
## Oceania-Europe 0.35643844 -0.1851867 0.8980636 0.3747658
For every pair except for (Asia, Americas) and (Oceania, Europe), we reject the null hypothesis that the logarithm of the energy usage means are the same.
data_popden <- data %>%
group_by(`Country Name`) %>%
select(`Country Name`, popden) %>%
summarise(avg_popden = mean(popden, na.rm = TRUE)) %>%
arrange(desc(avg_popden))
num_countries_shown <- 20
ggplot(data = head(data_popden, n = num_countries_shown)) +
geom_bar(
mapping = aes(x = avg_popden, y = reorder(`Country Name`, avg_popden)),
stat = "identity") +
xlab("Average population density (people per sq. km of land)") +
ylab("") +
ggtitle(str_glue(num_countries_shown, " most population dense countries 1962-2007"))
The country with the highest average population density between 1962 and 2007 is Macao SAR, China.
data_lifeExp <- data %>%
select(`Country Name`, Year, lifeExp) %>%
drop_na() %>%
group_by(`Country Name`) %>%
filter(any(Year == 1962)) %>%
mutate(lifeExpSince1962 = lifeExp - lifeExp[Year == 1962]) %>%
ungroup()
countries_highest <- data_lifeExp %>%
group_by(`Country Name`) %>%
mutate(lifeExpTotalChange = lifeExp[Year == 2007] - lifeExp[Year == 1962]) %>%
summarise(lifeExpChange = max(lifeExpTotalChange)) %>%
arrange(desc(lifeExpChange))
cutoff_lifeExpChange <- min(head(countries_highest, n = 10)$lifeExpChange)
data_lifeExpTop <- data_lifeExp %>%
group_by(`Country Name`) %>%
filter(lifeExp[Year == 2007] - lifeExp[Year == 1962] >= cutoff_lifeExpChange)
fig <- ggplot(data = data_lifeExpTop) +
geom_line(mapping = aes(
x = Year,
y = lifeExpSince1962,
color = `Country Name`)
) +
ylab("Life expectancy at birth since 1962 (years)")
ggplotly(fig)
The Maldives had the highest increase in life expectancy at birth from 1962 to 2007.